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In this article we intoduce a novel stochastic Hebb-like learning rule for neural net- 
works that is neurobiologically motivated. This learning rule combines features of un- 
supervised (Hebbian) and supervised (reinforcement) learning and is stochastic with 
respect to the selection of the time points when a synapse is modified. Moreover, the 
learning rule does not only affect the synapse between pre- and postsynaptic neuron, 
which is called homosynaptic plasticity, but effects also further remote synapses of the 
pre- and postsynaptic neuron. This more complex form of synaptic plasticity has re- 
cently come under investigations in neurobiology and is called heterosynaptic plasticity. 
We demonstrate that this learning rule is useful in training neural networks by learning 
parity functions including the exclusive-or (XOR) mapping in a multilayer feed-forward 
network. We find, that our stochastic learning rule works well, even in the presence of 
noise. Importantly, the mean learning time increases with the number of patterns to be 
learned polynomially, indicating efficient learning. 

Keywords: Hebb-like learning; neural networks; biological reinforcement learning; het- 
erosynaptic plasticity 



1. Introduction 

What are the laws that regulate learning on a neuronal level in animals or humans? 
So far this important question is open, however, the imagination one has for a bi- 
ological learning rule is that the synaptic weights are changed according to a local 
rule. In the context of neural networks local means that only the adjacent neurons 
of a synapse contribute to changes of the synaptic weight. Such a mechanism with 
respect to synaptic strengthening was proposed by Donald Hebb^in 1949 and ex- 
perimentally found by T. Bliss and T. Lomo ■ In a biological terminus Hebbian 
learning is called long-term potentiation (LTP). Experimentally as well as theoreti- 
cally there is a great body of investigations aiming to formulate precise conditions 
under which learning in neural networks takes place. For example the influence of 
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the precise timing of pre- and postsynaptic neuron firing EEI or the duration of a 
synaptic change (for a review see^J) termed short or long-term plasticity have been 
studied extensively. All of these contributions share the locality condition proposed 
by Hebb CO. In this article we present a novel stochastic Hebb-like learning rule 
inspired by experimental findings about heterosynaptic plasticity El. This form of 
neural plasticity affects not only the synpase between pre- and postsynaptic neuron 
in which a synaptic modification was induced, but also further remote synapses of 
the pre- and postsynaptic neuron. Additionally, we demonstrate that this learning 
rule can be successfully applied to train multilayer neural networks. 

This paper is organized as follows. In sectional we give the motivation for our 
learning rule by a summary of experimental observations concerning synaptic plas- 
ticity and properties of biological and artificial neural networks as far as they are 
useful for a better understanding of our learning rule. In section we propose our 
learning rule and give a mathematical definition. We investigate our learning rule 
in section 0] by numerical simulations. In section [3] we discuss and compare our 
stochastic learning rule with other learning rules. This article ends in section [HI with 
a conclusion and an outlook on further investigations. 

2. Overview of biological and artificial learning in neural networks 

One property that have all neural networks in common, biological as well as artifi- 
cial, is that there are two different processes taking place simultaneously. The first 
process concerns signal processing and the second learning. Signal processing is re- 
flected by the time dependent activity Xi(t) of a neuron i, whereas learning concerns 
the dynamical behavior of the synaptic weights tfy(t) between two neurons i and 
j in the network. One major difference between both dynamics is that they occur 
on different timescales. Normally, learning is much slower than the neural activity. 
Despite our focus in this article on the learning dynamics, we can not neglect a 
treatment of the neural activity, because both processes are coupled and influence 
each other. 

Figure ^ shows a schematic neural network consisting of 12 neurons. The 
synapses are not drawn directly from neuron to neuron but in two pieces. This 
shall depict the synaptic cleft of chemical synapses. The reason for this becomes 
more clear, when we describe our learning rule below. The left figure describes a 
signal path within a feed- forward network involving the neurons ri2, Uq, ris, n\i and 
the synapses between these neurons W26, ^68, w%\2- In this and all following figures 
we suppose that the signal flow and, hence, the orientation of the path, is from 
the top to the bottom. The neurons (synapses), which were actively involved in this 
signal processing, are drawn as black circles (full lines) . Concerning this information 
flow, Frey et al. found in the hippocampus of rats in vivo that there is a synaptic 
tagging mechanism. This mechanism tagges synapses which were repeatly involved 
in information processing within a certain time window of up to 1.5 hours. If one of 
these synapses is restimulated within this time interval then a synaptic modification 



6, 2008 21:20 WSPC/INSTRUCTION FILE emmert 



A Hetero synaptic Learning Rule for Neural Networks 3 




Fig. 1. Schematically depiction of a feed-forward neural network with time direction from top to 
down. Left: Visualization of the synaptic tagging mechanism experimentally found by Frey et al.. 
Right: Homosynaptic plasticity induced by the simmultanious activity of neuron rig and ng within 
a certain time window. 

is induced. One can interpret this as a kind of echo or memory within the neural 
network of past activity. Hence, the left Fig. 2] can be interpreted in a way that the 
depicted path from neuron ni to nyi is not the actual information flow, but the 
reflection of recent past activity, which the neurons and synapses can remember by 
an additional degree of freedom. 

Suppose now, that this signal flow caused a synaptic modification on &s 
depicted in the right Fig. This situation corresponds to the so called Hebbian 
learning^. Necessary conditions for this kind of learning are that the neurons, sur- 
rounding the synapse, were both active within a certain time window, which is in 
the ms range, and that the presynaptic neuron fires before the postsynaptic neu- 
ron 121. In biological terms Hebbian learning is also called long-term potentiation 
(LTP), because it strengthens the synaptic weight in contrast to long-term depres- 
sion (LTD), which weakens the synaptic weight, if the spiking time points of pre- 
and postsynaptic neuron are reversed. However, both kinds of learning, LTP as well 
as LTD, have one common ground, they are homosynaptic in respect to the number 
of synapses which are changed. 

Recently, there is an increasing number of experimental results, which investigate 
a new form of synaptic modification, the so called heterosynaptic plasticity. In 
contrast to homosynaptic plasticity, where only the synapse between active pre- 
and postsynaptic neuron is changed, heterosynaptic plasticity concerns also further 
remote synapses of the pre- and postsynaptic neuron. This scenario is depicted in 
the left Fig. [3 We suppose again, that the synapse u^s was changed either by LTP or 
LTD. Fitzsimonds et al.^found in cultured hippocampal neurons that the induction 
of LTD in w^s is also accompanied by back propagation of depression in the dendrite 
tree of the presynaptic neuron. Further more, depression also propagates laterally 
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Fig. 2. Left: Visualization of heterosynaptic plasticity experimentally found by Fitzsimonds et 
al.. The neurons (synapses) which are affected by heterosynaptic plasticity induced in w>68 are 
drawn as black circles (full lines). Right: Otmakova et al. found, that neurons in the CA1 region 
of the hippocampus receive a global reinforcement signal in form of dopamin. 

in the pre- and postsynaptic neuron. Similar results hold for the propagation of 
LTP, see ^ for a review. These experimental findings are depicted in the left Fig. [3 
We emphasize all synapses, whose weights are changed (wss, ^69 , u>26 , w 36), and all 
neurons, which enclose these synapses by drawing full lines respectively black circles. 
A direct comparison between the left Fig.|3 which depicts heterosynaptic plasticity, 
with the right Fig. ^ which depicts homosynaptic plasticity, reveals the tremendous 
difference in the affected number of synapses and the starlike spread of plasticity 
in some of the synapses connected with the two neurons, which were the case for 
the induction of plasticity in synapse wqs- We want explicitly to emphasize, that 
Fitzsimonds et al. found up to now no forward propagated postsynaptic plasticity. 
This would correspond to the synapses w^n, W&12 of neuron n%, which are drawn as 
dotted lines in the left Fig [21 A biological explanation for the cellular mechanisms 
of these findings is currently under investigation. Fitzsimonds et al. ^ suggest the 
existence of retrograde signaling from the post- to the presynaptic neuron which 
could produce a secondary cytoplasmic factor for back-propagation and presynaptic 
lateral spread of LTD. On the postsynaptic side lateral spread of LTD could be 
explained similarly under the assumption that there is a blocking mechanism for 
the cytoplasmic factor which prevents forward propagated LTD. They are of the 
opinion that extracellular diffusible factors are of minor importance. 

The experiments of Fitzsimonds et al. El are certainly an extention of homosy- 
naptic learning, which we denote briefly as Hebbian learning a , but nevertheless 
both principles can be characterized as unsupervised learning because both learn- 

a We are aware, that Hebbian learning is usually only used in the context of LTP as explained in 
the text above and want to emphasize by this paralance that homosynaptic LTP and LTD are 
more interrelated than e.g. homo- and heterosynaptic LTP. 
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ing types use exclusively local information available in t he neu ral system. This is 
in contrast to the famous back-propagation learning rule 1 ^ 1 -^ 1 for artificial neural 
networks. The back-propagation algorithm is famous because until the 1980's there 
was no systematic method known to adjust the synaptic weights of an artificial 
multilayer (feed- forward) network to learn a mapping b . Still, the problem with the 
back-propagation algorithm is that it is not biological plausible because it requires 
a back-propagation of an error in the network. We emphasize, that the problem 
is not the back-propagation process itself, because, e.g., heterosynaptic plasticity 
could provide such a mechanism as depicted in the left Fig. [21 but the knowledge 
of the error, which can not be known explicitely to the neural network H( For this 
reason learning by back-propagation is classified as supervised learning or learning 
by a teacher ">^. However, there is a modified form of supervised learning namely 
reinforcement learning that is biologically plausible. Reinforcement learning reduces 
the information provided by a teacher to a binary reinforcement signal r that reflects 
the quality of the network's performance. Interestingly, experimental observations 
from the hippocampus CA1 region have shown that there is a global signal in form 
of dopamine which is feedback to the neurons and causes thereby a modulation of 
LTD Schematically, this is depicted in the right Fig.|3 In this figure each neuron 
is connected with an additional edge which represents the feedback of dopamin in 
form of a reinforcement signal r. 

Based on the experi mental findings by Frey et al. ^ and Otmakova et al. ED 
Bak and Chialvo | 14 | 15 | as we ll as Klemm et al. ^1 suggested biologically inspired 
learning rules for neural networks that combine unsupervised Hebbian (homosy- 
naptic) with reinforcement learning. We call this kind of combination of Hebbian 
and reinforcement learning Hebb-likc learning to indicate that the learning rule is 
different from Hebb, but contains nevertheless characteristics which are biological 
plausible. This includes the extention from purely unsupervised to a combination 
of unsupervised and reinforcement learning. The question which arises now is: How 
can one construct a Hebb-like learning rule which mimics additionally the learning 
behavior of heterosynaptic plasticity found by Fitzsimonds et al. ^3 This question 
will be addressed in the next section. 



3. The Definition of the stochastic Hebb-like learning rule 

The working mechanism of the learning rule we suggest is based on the explanation 
of Fitzsimonds et al.^for heterosynaptic plasticity given above. To understand what 
kind of mathematical formulation is capable to describe 'a secondary cytoplasmic 
factor' in a qualitative way we start our explanation with emphasizing that a neuron 
is from a biological point of view first of all a cell. The subdivision of a neuron in 
synapses, soma (cell body) and axon is a model and reflects already the direction 

b The back-propagation algorithms was independently developed by Werbos Inland Rumelhart et 
al. but became known to the broad research community after the article by Rumelhart et al.. 
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of the information flow within the neuron namely from the synapses (input) to the 
soma (information processing) to the axon (output). Here, we do not question this 
model view with respect to the direction of signal processing, but to learning. We 
see no biological reason why the model of a neuron for signal processing should be 
the same as the model of a neuron for learning. In Fig. [3] we emphasize the cell 
character of a neuron by underlying the contour of the whole neuron in gray. Now, 
our reason for drawing the synapses in an unusual way becomes clear, because it 
emphasizes automatically the cell character of a neuron. 

Suppose now, we assign to each neuron in the network one additional parameter 
Cj as shown in Fig. |31 We call these parameters c% neuron counters. The neuron 
counters shall modulate the synaptic modification in a certain way defined in detail 
below. According to our cell view of the neuron, we assume further that the neuron 
counters of adjacent neurons, which are connected by synapses, can communicate 
with each other in an additive way. E.g., in Fig. |3the neuron counters cq and eg 
form a new value d^g — eg + c% in synapse wqs, which we call the approximated 
synapse counter. By this mechanism we obtain a star-like influence of, e.g., the 
neuron counters cq and c% on all synapses connected with neuron 6 or 8, because 
either d^k — cq + ct or dks = cu + cs holds and regulates the synaptic update of 
the corresponding synaptic weight of the synapses w^k and Wks respectively. This 
situation corresponds in a qualitative way to the learning behavior of heterosynaptic 
plasticity, however, with the difference, that we have a fully symmetrical learning 
rule. An interpretation of the communication between adjacent neuron counters can 
be given, if one views the neuron counters as cytoplasmic factors, which are allowed 
to freely move within the cytoplasm of the corresponding neuron (cell). Because, we 
introduced no blocking mechanism for the forward propagation of the postsynaptic 
neuron counter we result in a fully symmetric communication between adjacent 
neuron counters. 

In the next section, we define the qualitative principle for heterosynaptic learn- 
ing presented above mathematically. Unfortunately, there are no experimental data 
available that would allow to specify the influence of dy on the corresponding 
synapse Wij quantitatively. For this reason, we use an ansatz to close this gap 
and make it plausible H^l 

3.1. Mathematical Definition of the learning rule 

If one assumes, that the neuron counters shall modulate learning, then it is plausible 
to determine the values of Cj as a function of a reinforcement signal r reflecting the 
performance of the network qualitatively. In the most simple case, the dynamics of 
the neuron counters depends linearly on the reinforcement signal. 




(1) 
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Fig. 3. Visualization of the symmetry of our learning rule. The cell character of a neuron is 
emphasized by a schematical contour around the whole neuron. Each neuron in this network has 
a new degree of freedom Ci we call the neuron counter. Neuron counters can communicate which 
each other across synapses. The dynamics of the neuron counters is determined by the global 
reinforcement signal. 

Here, O 6 M is a threshold that restricts the possible values of the neuron counters 
to O + 1 possible values {0, . . . , 0}. The value of Ci reflects the network's perfor- 
mance, but it has only relative and no absolute meaning with respect to the mean 
network error. This can be seen by the following example . Suppose c — 0, then 
it is clear that at least the last output of the network was right, r — 1. However, 
we know nothing about the outputs which occurred before the last one. E.g., the 
following two sequences of reinforcement signals can lead to the same value of the 
neuron counter c = 0: r\ = {1, —1, 1, —1, 1, —1, 1} and = {1, 1, 1, 1, 1, 1, 1} if the 
start value is c = 1 for r\ and c — 7 for T2- Obviously, the estimated mean error 
is different in both cases, if averaged over the last seven time steps. The crucial 
point is, that the start value of the neuron counter is not available for the neuron 
and, hence, the neuron can not directly calculate the mean error of the network. 
However, we can introduce a simple assumption, which allows an estimate of the 
mean network error. We claim that, if c 1 < c 2 for one neuron in a network, but 
trained by two different learning rules d then the mean error of network one is lower 
then of network two. This may not hold for all cases, but it is certainly true in 
average. By this we couple the value of the neuron counter to the mean error of 
the network. Due to the fact, that this holds only statistically, we will introduce a 
stochastic rather than a deterministic update rule for the synapses that depends on 

c We omit the index for simplicity. 

d The superscript indicates here not the neuron in the network, but the learning rule used. 
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the neuron counters. In the previous section we said, that adjacent neuron counter 
can communicate, if both neurons are connected by a synapse. This gives a new 
variable 

dij = Cj + Cj (2) 

we call the approximated synapse counter. We will use the approximated synapse 
counter as the driving parameter of our stochastic update rule, because its value 
reflects the performance of the synapse in the network which shall be updated, 
because the synapses are the adaptive part of a neural network. Hence, evaluating 
the value of an approximated synapse counter of a synapse will give us indirectly 
a decision for the update of this synapse. It is clear that, roughly speaking, the 
higher the approximated synapse counter of a synapse is the higher should be the 
probability the synapse is updated. This intuitively plausible assumption will now 
be quantified . 

Similar t l 15 l 14 l 16 l nly active synapses Wij which were involved in the last signal 
processing step can be updated, if the output of the network was wrong. This is 
plausible, because it prevents that already learned mappings in the neural network 
are destroyed possibly. If r — —1 the probability, that synapse Wij is updated is 
given by 

P Aw (w i j)=P(p c <pl ij ). (3) 

This probability has to be calculated for each synapse in the network. We want 
to emphasize, that this needs only local information besides the reinforcement signal. 
Hence, it is a biologically possible mechanism. If the synapse is actually chosen for 
update, the synaptic weight will be modified by 

Wij^w'tj = w tj - 8. (4) 

Here, 6 is a positive constant which determines the amount of the synaptic depres- 
sion. To evaluate the stochastic update condition Eq. [21 the two auxiliary variables 
p c and p v d .. have to be identified. This is done in the following way: 

(1) Calculate the approximated synapse counter 

dij = Ci + Cj. (5) 

(2) Map the value of the approximated synapse counter dij to p r d . . by 

k = 26 + 3 - dij (6) 
P r k oc k-\ t e K+, k e {1,. ..,28 + 3}. (7) 

We call P£ rank ordering probability distribution 

(3) The random variable p c is drawn from the continuous coin distribution 

P c (i)«^ael + ,ie(0,i]. (8) 
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Fig. 4. Update probability Pa w as function of p\ . . The different curves correspond to different 
values of the exponent a of the coin distribution. The values are: c«i = 0.0, 02 = 0.5, 03 = 0.75, 
04 = 1.0 and 05 = 1.75. 

We had three reasons to choose a power law in Eq. [5] for the coin distribution 
instead of an equal distribution, which would be the simplest choice. First, we see 
no evidence that a random number generator occurring in a neural system should 
favor a equal distribution. Second, it is highly probable that two different random 
number generators of the same biological system are not identical. Instead, they 
could have different parameters, in our case they could have different exponents. 
In this paper we will content ourself investigating the case of identical random 
number generators, but our framework can be directly applied to the described 
scenario. Third, by choosing a = 0, the coin distribution in Eq. |H1 becomes the 
equal distribution. This allows us to investigate the influence of the distance of 
the coin distribution to an equal distribution on the learning behavior of a neural 
network by studying different parameters of a. We want to remark, that in this case 
the update probability Eq. simplifies to 

PAw(Wij) = pi... (9) 

Before we present our results in the next section, we want to visualize the 
stochastic update probability Pa w ■ Figure 01 shows the update probability Pa™ 
as function of p T d ... The different curves correspond to different values of the expo- 
nent a of the coin distribution. One can see, that the update probability follows 
the values of p^.. . This holds for each curve in Fig. 0] That means, the higher the 
values of p di . are the higher is the update probability. This is the behavior one 
would intuitively expect, because high values of p di . correspond to high values of 
the approximated synapse counters dy indicating high values of the neuron coun- 
ters, which correspond to a bad network performance. Moreover, one can see in Fig. 
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Fig. 5. Update probability Paw as function of r and dij obtained for = 4. The contour plot 
of Paw is shown. The update probability is close to one in the upper right corner and close to 
zero in the uppler left corner. The exponent a of the coin distribution was a = 0.7 in the left and 
a = 1.5 in the right figure. 

0] that the larger a the higher is the update probability for fixed pjj.. . In the limit 
a — > oo the update probability equals one for all values of p d ... Hence, higher values 
of the exponent a of the coin distribution result in a higher update probability. 
That means, by a one can control the sensitivity by which the update probability 
depends on p A ._. Another parameter our stochastic update rule depends on is the 
exponent of the rank ordering distribution r. We display in Fig. as function 

of r and pjj.. to visualize its influence on the update probability. The values of the 
update probability are color-coded and blue corresponds to and red to 1. For the 
left Fig.[5]we used a — 0.7 and for the right a = 1.5 as exponent for the coin distri- 
bution. If p r dij = no update takes place. For increasing values of the approximated 
synapse counter and fixed values of r one obtains increasing values for the update 
probability. Moreover, higher values of a lead to higher values of Paw This can be 
seen by comparing the left and right Fig. [5j Increasing values of r result in decreas- 
ing values of Paw for fixed d^. To summarize, the stochastic update condition we 
introduced for a synaptic update depends on six parameters 



From the visualizations we gave in Fig. 0] and [S] we saw that increasing values of dij 
and a as well as decreasing values of r lead to an increase in the update probability. 

4. Numerical Simulations 

For the following simulations we use a three-layer feed-forward network. The neural 
network consist of / input-, H hidden- and O output neurons. The neurons of 
adjacent layers are all to all connected with synapses Wij £ M. As neuron model we 
us binary neurons Xi G {0, 1} for i £ {1, . . . , J + H + O}. The network dynamics is 
regulated by a winner-take-all mechanism whereas the inner fields of the neurons 



PAw(wij) = PAw(r,Ci,Cj,Q 7 a,T). 



(10) 
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are calculated by 

all 

hj=^w ji x l . (11) 

i 

Here, all means all neurons of the preceding layer. As active neuron in each layer 
we choose the neuron with the highest activity 

«max = argmax(/i l ) (12) 

i 

which is set to £i max = 1. All other neurons are set to zero. By this we enforce a 
sparse coding. Bak and Chialvo -31 have called this extremal dynamics. 

The training of the neural network works as follows: We choose randomly one 
of the possible input patterns and initialize the neurons in the input layer. Then we 
calculate according to the network dynamics Eci. lllll2*l the activity of the neurons in 
the subsequent layers. If the output of the network is correct we set r = 1 otherwise 
the reinforcement signal is set to r = —1. According to Eq. Q]we calculate the new 
values of the neuron counters for the neurons which were active during the signal 
processing of the input pattern. If r = —1 wc apply our stochastic learning rule 
otherwise we proceed with the next input pattern until the network converged. 

The mapping which shall be learned by the network is the exclusive-or (XOR) 
function and higher dimensional extensions thereof called the parity problem. One 
can describe the mappings from the parity problem class as indicator functions for 
an odd or even number of l's in the binary input vector (x[, . . . , x T k ) of the network. 
If the number of l's in the input vector is odd the output of the network shall be 
(xf = l,x§ = 0) if it is even (x° — 0,x° = 1). In this sense, the exclusive-or 
(XOR) function is the two dimensional k — 2 representative of this class. To avoid 
the case of a zero input vector, which would result in zero activity of subsequent 
layers, we introduce a bias neuron a;^. +1 = 1. Here, the index k is given by the 
exponent of the maximal number of patterns p — 2 k which can be realized by a 
random binary vector of length k. For the following simulations the initial weights 
of the network were chosen randomly from [0, 1] and the neuron counters were all 
set to zero. The learning rate S was randomly chosen from [0, Sq], with <5o = 0.1, 
each time when a synaptic modification was induced. 

4.1. XOR Function 

We start our investigations by studying the influence of the memory length of the 
neuron counters O and the exponents a and r on the mean ensemble error E of 
the network's performance during learning the XOR function. The contour plot in 
Fig.Elshows the simulation results for = 3 and three neurons in the hidden layer. 
The mean ensemble error E was obtained by averaging over independent runs of an 
ensemble of size 10000 and is displayed at the time steps t = 500 (left figure) and 
t = 1500 (right figure) during the learning process. To find the optimal parameter 
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t=500 t=1500 




Fig. 6. Mean ensemble error E in dependence on r and a at the time steps t = 500 (left) and 
t = 1500 (right) during the learning process for © = 3. The network dynamics was a winner-take-all 
mechanism and the ensemble size 10000. 

configuration 

(0*,a*,r*) = argmm£(0, a, r;t) (13) 

@,Q,r 

which minimizes the mean ensemble error E(Q, a, r; t) we keep fixed and vary a 
and r in the interval [0.0, 3.0] in 10 _1 steps. 

From Fig. one can see that learning takes place in the whole parameter space 
(a,r). Of course there are regions in which learning is much faster than in others 
due to the fact that the resulting update probability of our learning rule, controlled 
by (a,r), is more suitable for the learning task. To investigate the O dependence 
of our learning rule we repeated these simulations for several values. The results 
for the optimal parameter configurations (0*,a*,r*) from these simulations can 
be found in table From these results one can conclude that there is no single 
parameter configuration in this 3 dimensional parameter space, which minimizes 
E. But there exist multiple parameter configurations resulting in almost the same 
performance with respect to the absolute convergence of the network. Interestingly, 
from table n]° ne can see > that with increasing values of 0, a also increases but r is 

Table 1. Minimal mean ensemble error _E m i n (0* , a* , t* ; t) obtained by the 
optimal parameters ©* ,t* and a* for the time steps t = 500, 1000 and 1500 
(left, middle and right column). The ensemble size for each simulation was 
10000. 



e* 




a* 


E ■ 


i 


2.1 


2.0 


2.2 


0.6 


0.6 


0.8 


0.099 


0.021 


0.004 


2 


1.8 


2.1 


2.6 


0.9 


1.1 


1.2 


0.113 


0.122 


0.007 


3 


2.4 


2.3 


2.3 


1.7 


1.7 


1.7 


0.122 


0.032 


0.009 


4 


2.2 


2.2 


2.2 


2.8 


3.0 


3.0 


0.112 


0.022 


0.004 


5 


1.9 


1.9 


1.9 


2.4 


3.0 


3.0 


0.087 


0.018 


0.003 


t 


500 


1000 


1500 


500 


1000 


1500 


500 


1000 


1500 
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Number of neurons in the hidden layer 



-a ~©=1 




5 10 15 20 

Number of neurons in the hidden layer 

Fig. 7. Mean learning time in dependence on the number of neurons in the hidden layer, r and 
a are given in tabled at the time step t = 1500 for the corresponding values of ©. The network 
dynamics was a winner-take-all mechanism (upper figure) and a noisy winner-take-all mechanism 
(lower figure). The ensemble size was for all simulations 10000. 

almost constant. Based on our explanation in section 13*11 about the dependence of 
Paw on a and r we can conclude, that higher values of O require a higher update 
probability. This makes sense, because the complexity of the mapping to be learned 
by the network was not changed. Only the memory length of the neuron counters 
was enlarged. Apparently, this was not necessary and, hence, would result in worse 
results, because averaging over a longer time interval is more time consuming. 
This effect is compensated by the higher a value resulting in more frequent updates. 
For our subsequent investigations we use the optimal parameter values obtained at 



6, 2008 21:20 WSPC/INSTRUCTION FILE emmert 



14 Frank Emmert- Streib 

the learning time step t = 1500 from table 

Based on these results we study systematically the dependence of the mean 
learning time from the network topology and the network dynamics. In the left Fig. 
Q we show the mean learning time as function of the number of neurons H in the 
hidden layer. The curves are indexed by different values of the neuron counter O. 
In the lower figure we demonstrate the robustness of these results in the presence of 
noise r\ by using a noisy winner-take-all mechanism as network dynamics which adds 
to the inner fields Ea. II II of the neurons noise r\ before the neuron with the highest 
inner field is selected. The noise was uniformly drawn from [0, 7y ] with 770 = 4j*-. From 
both figures one can see that the mean learning time decreases with an increasing 
number of neurons in the hidden layer as expected whereas the increase from 3 
to 4 neurons has the biggest effect. This is due to the fact that the destructive 
path inference, which means that already correctly learned paths in the network 
are destroyed by a new synaptic modification, is strongly reduced by increasing 
the number of possible paths as a result of additional neurons in the hidden layer. 
Increasing the number of neurons beyond 19 has only marginal influence because 
an additional increase of redundant paths has no affect. Even in the presence of 
noise our learning rule is capable of learning the XOR function. One can nicely see 
how an increasing number of neurons in the hidden layer can efficiently reduce the 
amount of noise in the system. 

4.2. k- dimensional parity functions 

In this subsection we study the influence of the number of patterns to be learned 
on the mean learning time. We use p = 2 k input patterns, for k £ 1, ... ,6, and 
correspondingly I = k + 1 neurons in the input layer 6 and H = 1500 neurons in 
the hidden layer. The network dynamics was again regulated by a winner-take-all 
mechanism. Our results shown in Fig.[S]for the mean learning times are comparable 
to the results obtained by Bak and Chialvo with the difference that they even 
used 3000 neurons in the hidden layer. Moreover, the mean learning time scales f 
with problem size p according to a power law ~ p@ with exponent /3 ~ 1.8. This 
demonstrates not only, that our stochastic learning rule is able to learn the problem 
but also, that learning is efficient, because otherwise the mean learning times would 
follow an exponential function. 

4.3. Influence of exponential distributions on the learning behavior 

Finally, we investigated the influence of the type of the probability distribution used 
for the coin and rank ordering distribution. Here, we use an exponential distribu- 
tion for the coin and rank ordering distribution and study the learning behavior. 
We found significantly worse results compared to the results for the power law (not 

c This includes one neuron as bias. 

f See caption to figure|2]for numerical values for for the three different curves. 
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Number of patterns to be learned 



Fig. 8. Mean learning time in dependence on the number of patterns to be learned. The network 
consisted of H = 1500 neurons in the hidden layer. The mean learning time was averaged over 
an ensemble of size 1000. The symbols correspond to results obtained from simulations whereas 
the lines are the results from a least mean square fit. The exponents for the power laws are 
P = {1.68, 1.84, 2.15} in acceding order of 6. 

shown) presented in the last section. To understand this, we display in Fig. ED the 
update probability as function of r and d^. One can see, there are essentially only 




2 4 6 8 10 



Fig. 9. Contour plot of the update probability P/\w as function of r and p\ obtained for G = 4. 
The parameter of the coin distribution was a = 1.0. The coin and the rank ordering distribution 
was an exponential function. The update probability is almost always close to zeros but increases 
rapidly for r — » 3 and dij — > 10. 
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two states, the update probability can take, zero and one (upper right). That means, 
Paw produces a rather deterministic update behavior which is inappropriate, be- 
cause the information provided by the approximated synapse counters is uncertain. 
Other values for a show qualitatively the same results. This demonstrates that 
the larger variability provided by a power law distribution is important for a good 
learning behavior. 

5. Discussion and comparison of learning rules 

Mathematical investigations of biological as well as artificial learning rules for neural 
networks have been attractive to scientists since decades, because of the importance 
of the underlying problem and implications arising out of an understanding thereof. 
We want to finish this article, by discussing and comparing our novel stochastic 
Hebb-like learning rule with other models introduced so far, which are constrained 
in a way that makes them biologically plausible. 

Bak and Chialvo ^^'^ introduced a learning rule which combines Ant i-Hebb or 
long-term depression (LTD) and reinforcement learning. Klemm et al. extended 
the learning rule from Bak and Chialvo by introducing one additional degree of free- 
dom for each synapse in the network. They called this degree of freedom synapse 
counter. Moreover, Bosman et al. proposed a learnin g rul e which incorporates Hebb 
(LTP), Anti-Hebb (LTD) and reinforcement learningL^. All these approaches have 
in common with our learning rule, that they utilize a reinforcement signal as feed- 
back reflecting the current performance of the network. The usage of a reinforcement 
signal seems not only to be plausible but indispensable to learn mappings, because 
the neural network has to adapt to its environment by interacting with it otherwise 
the animal will die fast. Similar to physical energy 8 , it is also impossible to generate 
information out of nothing in a meaningful way. The reinforcement signal makes a 
neural network and, hence, a brain, an open system according to the flow of in- 
formation. This depicts intuitively the difficulty of the system under investigation, 
because open or dissipative systems are by far less understood than closed, e.g., 
Hamiltonian systems. 

In contrast, all models proposed before are purely deterministic with 
respect to the decision if an update for a synapse shall take place or not. Addition- 
ally, all learning rules | 14 | 15 | 1$ LL&I can only explain homosynaptic plasticity. We 
think, due to the fact that the neural network is an open system it can not make 
deterministic decisions which are objective, because of the lack of complete informa- 
tion. Of course, one can always search for the best decision based on the amount of 
information available in the system. However, this internal (in the neural network) 
optimality does not guarantee external (the overall network performance) optimal- 
ity. In this article, we took the point of view, that we assume we have incomplete 
information and, hence, we are only able to provide an update probability indicat- 



g Perpeduum mobile. 
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ing a kind of confidence level for this update based on our incomplete information. 
Explicitely, this enters our model in form of the approximated synapse counters. 
For every network topology one can calculate thesynapse counter as a function of 
the neuron counters introduced by Klemm et al. LL^I However, this results normally 
in relations, which involve not only the neuron counters enclosing the synapse, but 
also further remote neuron counters 1^1 This can be seen with the help of Fig. [3] 
For example, the neuron counter of neuron five can be written as a linear sum of 
the synapse counters: 



These equations represent a failure conservation for the incoming and outgoing 
connections respectively. If the neuron counter of neuron five is C5 then the sum of 
all synapse counters leading to neuron five has to be equal to this number, because 
there is no other way information can involve neuron five in the signal processing. 
The same holds for the outgoing information, represented by Eq. In general, 
such linear failure conservation relations between the neuron and synapse counters 
lead to the linear system 



Here, c n represents the iV-dimensional vector of neuron and c s the S-dimensional 
vector of synapse counters. The integer valued ./V times S matrix M. depends on 
the network topology. The problem becomes nonlinear if one wants to obtain the 
synapse counters as function of the neuron counters, because the inverse of the non- 
quadratic matrix A4 in Eq. 1161 can only be done by calculating a pseudoinverse to 
obtain c s — M.~ x c n . This is the situation we are facing. Explicite calculatio n by 
using the Moore-Penrose pseudo inverse ED leads to the statement given above H^l 
Hence, a biologically plausible learning rule can not use these relations, because this 
would violate the local information condition in neural networks. One possibility 
around this obstacle is to approximate the synapse counter by the sum of the neuron 
counters enclosing this synapse, however, with the additional assumption to view 
the resulting value in a probabilistic rather than deterministic way. Our simulations 
showed, that a merely addition (or multiplication) of the neuron counters does not 
lead to meaningful results at all Moreover, also the used probability distributions 
have significant influence on the learning dynamics as demonstrated in the results 
section^] The fact, that power law distributions give significantly better results than 
exponential distributions for the coin and rank ordering distribution corresponds to 
results of recent investigations of heuristic optimization strategies. Boettcher et al. 
^^demonstrated that the usage of power law distributi ons in optimization problems, 
e.g., finding the energy ground states for spin glasses and graph bi-partitioning 
1^4 which are both NP-hard optimization problems, can give better results compared 
to simulated annealing r genetic algorithms E21 They explained this effect by 



C5 = 0^25 + ^35 
C5 = ^57 + dss 



(14) 
(15) 



= Mc s 



(16) 
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the positive influence of the inherently large fluctuations within the system, which 
prevents to get trapped a long time in local minima of the error function. 

From a biological point of view the most significant differ ence betwe en our 
stochastic Hebb-like learning rule and the other learning rules l 14 l la l 1D l lg l [ s cer- 
tainly that our model aims to explain heterosynaptic plasticity, which has been 
found experimentally ®, instead of homesynaptic plasticity, in a qualitative way. 
This is also the major objective of this paper. Hence, a direct comparison between 
our model and the other learning rules can not be given fairly without neglecting or 
underestimating significant components of our model. For example, we introduced 
one n ew d egree of freedom for each neuron in the form of neuron counters. B osn ian 
et al. ^1 do not rely on this or similar parameters whereas Klemm et al. in- 
troduced one additional degree of freedom for each synapse. That means, in this 
context our model has N parameters, the model of Bosman et al. none, and Klemm 
et al. kN parameters. Here, let k be the average number of synapses a neuron has 
in a network. This makes the learning rule of Bosman et al. in a mathematical sense 
minimal compared to ours. However, biologically it can not describe heterosynaptic 
plasticity and, hence, lacks this ability, which makes a comparison in the number 
of parameters meaningless. Interestingly, despite the fact, that heterosynaptic plas- 
ticity is more complex then homosynaptic plasticity the learning rule of Klemm et 
al. uses k times more parameters than our model. In general, we think that due to 
the almost overwhelming complexity of biological phenomena mathematical model- 
ing should stay always in tight contact with experimental findings to constrain the 
model by regularities found in nature. These constrains can only lead to minimal 
mathematical models in the context under consideration, but not beyond. 



6. Conclusions 

We presented a novel stochastic Hebb-like learning rule for neural networks and 
demonstrated its working mechanism exemplary in learning the exclusive-or (XOR) 
problem in a three-layer network. We investigated the convergence behavior by ex- 
tensive numerical simulations depending on three different network dynamics which 
correspond all to biological forms of lateral inhibition. We found in all cases param- 
eter configurations for 0, the length of the neuron memory, a, the exponent of the 
coin distribution and r, the exponent of the rank ordering distribution, which con- 
stitute the Hebb-like learning rule, to obtain not only a solution to the exclusive-or 
(XOR) problem but comparably we ll re sults to a learning rule recently proposed 
by Klemm, Bornholdt and Schuster 021. This is remarkable, if one ke eps in mind 
that our learning rule uses less parameters than the model proposed by^"". Because 
the number of neurons is always (much) less then the number of synapses the same 
holds for the respective numbers of synaptic and neuron counters which were used 
in the learning rules. 

An interesting implication of our learning rule and its inherent stochastic char- 
acter is that it offers a quantitative biologically plausible explanation of heterosy- 
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naptic plasticity which is observed experimentally. In addition to the experimentally 
observed back-propagation, pre- and postsynaptic lateral spread of long-term de- 
pression (LTD) our learning rule predicts forward propagated postsynaptic LTD 
for reasons of a symmetric communication between adjacent neurons. As far as we 
know there is no theoretical explanation of that phenomenon so far and we are 
looking forward to new experiments helping to clarify this important question. 
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